Conversation
Contributor
Author
|
@danielcweeks, FYI. This branch adds manifest file lists for snapshots and adds a filter used to skip reading manifest files while planning scans. Please review if you have time. |
Contributor
|
+1 It would be good to clarify the plan with respect to the manifest list vs the manifest list location. If we plan use list location as primary going forward, we should probably mark the the former as deprecated (even if still supported). One comment nit, other than that it looks good. |
This adds a new table property, write.manifest-lists.enabled, that defaults to false. When enabled, new snapshot manifest lists will be written into separate files. The file location will be stored in the snapshot metadata as "manifest-list".
This expression evaluator determines whether a manifest needs to be scanned or whether it cannot contain data files matching a partition predicate.
This modifies SnapshotUpdate when writing a snapshot with a manifest list file. If files for the manifest list do not have full metadata, then this will scan the manifests to add metadata, including snapshot ID, added/existing/deleted count, and partition field summaries.
This optimizes ScanSummary and FileHistory to ignore manifests that cannot have changes in the configured time range.
6c95bc7 to
11c6a83
Compare
Contributor
Author
|
Since the review included #3, I merged that first and rebased this. I'll merge this when tests are passing. |
SinghAsDev
referenced
this pull request
in SinghAsDev/iceberg
Jan 29, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds a separate file, a manifest list, to track the manifests for a snapshot. The manifest list is an Avro file with a row for each manifest. The file columns are used to avoid reading manifests to look for data files.
Columns include:
manifest_path: path of the manifest filepartition_spec_id: ID of the partition spec used to write the manifest (depends on Store multiple partition specs in table metadata. #3)added_snapshot_id: snapshot ID when the manifest was added to the tableadded_data_files_count,existing_data_files_count,deleted_data_files_countto track operationspartitions: a summary (min, max, and containsNull for each field) of the partitions in the manifest fileManifest lists are written when the table property write.manifest-lists.enabled is set to true.
Manifest lists are written in the metadata file in place of a list of manifest locations. The snapshot object includes a "manifest-list" key instead of the "manifests" key.